-
Notifications
You must be signed in to change notification settings - Fork 27k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniformize model processors (models *with* special arg names) #32841
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution! I think handling the text_pair
etc kwargs is a bit more challenging, we should be able to use it solely with TypedDicts and without adding extra args
if output_kwargs["text_kwargs"].get("text_pair") is not None and audio is not None: | ||
raise ValueError( | ||
"You cannot provide `text_pair` as a positional argument and as a keyword argument at the same time." | ||
"Please provide it only as a keyword argument (i.e. `text_pair=...`)." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it is not robust to rely on audio
here - or at least we have to explicitly comment why we are doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what we also do in UDOP. Maybe a comment explaining is enough as audio arg-name isn't expected in Nougat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, I just followed the code here: #32544 (comment)
without this, old code that passed the args as positional arguments would break
I'll add comments & warnings for now, but lemme guys know what you think is best
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented on the other PR as well. I missed it for Udop but I don't think it's a good precedent unfortunately - it's a bit hacky as it stands, trying to see if we can do it some other way that's cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising this! I have a new approach which I described in more detail here: #31911 (comment)
pls lemme know what you think abt it!
831e64c
to
257c690
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this awesome work!
The workaround seems a bit hacky but I guess we need that for BC. Can we add tests for the newly added prepare_and_validate_optional_call_arg
in the models that support extra args? And let's add a version when we'll stop handling BC for users, it can be until v4.47
which gives two major versions for users
src/transformers/processing_utils.py
Outdated
raise ValueError( | ||
f"Expected *at most* {len(self.optional_call_args)} optional positional arguments in processor call but received {len(args)}." | ||
"Passing positional arguments to the processor call is not recommended" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message imo isn't very informative for users who have no idea what is happening under the hood. Maybe we could show optional_call_args
names and say we couldn't map positional args to those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just changed the error message
@zucchini-nlp can you check if the new one's okay?
text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
images: Optional[ImageInput] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another thing we're doing now is swap the arg order, so that it is image, text, audio, videos
. And that needs another deprecation cycle...
BTW, i am quite out of the loop, do we need this order-swapping for pipeline @yonigozlan ?
@@ -232,6 +233,7 @@ def thumbnail( | |||
The channel dimension format of the input image. If not provided, it will be inferred. | |||
""" | |||
input_height, input_width = get_image_size(image, channel_dim=input_data_format) | |||
size = get_size_dict(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not clear why we needed these changes, was this causing CI failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are utils for old processors that don't support the new image size format yet
we might as well add these here since (1) they help w/ backwards compatibility, (2) make the image-text-to-text pipeline easier to implement, & (3) they just revert to a no-op if size
already follows the new image size format
text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
images: Optional[ImageInput] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, and also, these two should be swapped so that the order is image, text, kwargs
text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None, | ||
images: Optional[ImageInput] = None, | ||
# The following is to capture `visual_prompt` argument that may be passed as a positional argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here for swapping
text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] | ||
text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] | ||
text_pair_target: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be added here and not be treated as model-specific args?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they're in the base tokenizer class so I figured it's okay to put them here
lemme know if I should remove them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally for me it's okey to have it here, as it is a general arg accepted by all tokenizers, though not used by all processors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, that will remove the need for special args for Udop so that's great :)
I'll also swap the This PR should now be ready for review |
Hi @leloykun ! Pinging this as blocking requirements for this PR should all be completed soon! Notably the function to check that the images and text inputs are in the correct order is merged. Here's how it is used in LlaVa for example. Also I think your |
Hi again @leloykun , I opened a PR with some of your changes to processing_utils.py and test_processing_common.py here #33479 . If you think you will have the bandwidth to work on this, feel free to open a PR and I'll close mine. In any case, thanks a lot for your contributions on this! |
What does this PR do?
[text, images, audio, videos]
and not config values for the tokenizer, image processor, etc.(arguments that carry data
TODO:
Fixes # (issue)
Who can review?
@zucchini-nlp @molbap @NielsRogge